Nature Machine Intelligence
○ Springer Science and Business Media LLC
Preprints posted in the last 30 days, ranked by how well they match Nature Machine Intelligence's content profile, based on 61 papers previously published here. The average preprint has a 0.13% match score for this journal, so anything above that is already an above-average fit.
HOU, Z.; Lee, V. H.-F.; Kwong, D. L.-W.; Guan, X.; Liu, Z.; Dai, W.
Show abstract
The advent of artificial intelligence (AI) has brought revolutionary tools for biomedical transcriptomic (RNA-level) research. However, there are persistent constraints including limited interpretations with biomedical concepts such as functional pathways, small sample sizes and substantial time and computing power requirements for AI training. To overcome these limitations, we developed RNAGAN (https://github.com/ZhaozhengHou-HKU/RNAGAN-1.0.git), an AI tool with a generative adversarial network (GAN) structure with the objective of enhancing transcriptomic analysis. The network was established based on public human datasets comprising 4.6 million single cells from multiple organs and 5,900 sequenced samples of various cancer types with normal references. A specialized pathway neural layer was embedded to extract activities of predefined pathways from the Human Molecular Signatures Database (MSigDB), or newly learned pathways from single-cell data. The structure of RNAGAN (generator and discriminator) enables four applications after one shared training procedure: 1. single-cell and bulk-level patient stratification or differential diagnosis; 2. analysis of the gene and pathway markers in a selected disease; 3. pseudo data generation when sample size is limited for downstream analysis; 4. vectorization with gene and pathway-level features learned from multiple data sets. RNGAN contributes to the efficient utilization of limited data for transcriptomic studies.
Dee, W.; Wenteler, A.; Seal, S.; Morris, O.; Slabaugh, G.
Show abstract
AO_SCPLOWBSTRACTC_SCPLOWPervasive batch effects are a common issue, especially in recent large-scale Cell Painting datasets, which have been produced to aid AI-enhanced drug discovery efforts. Technical differences arising from experiments carried out in different batches can cause models to fail to generalize to unseen batches, despite good predictive performance "within batch". We propose a biologically grounded test-time adaptation framework, SHOT-CCR, which uses cell-invariant gradient reversal to decouple morphological signal from experimental confounders. Our approach performs 4.5% better than the current RxRx1 benchmark, classifying 1,139 classes of siRNA genetic perturbations with 91.6% accuracy. We deliver consistent results over four distinct cell types and two prominent Cell Painting datasets - RxRx1 and a subset of JUMP-CP. Across 484 classes of CRISPR perturbations in JUMP-CP our method improves accuracy by 15.7%.
Lu, H.-E.; Koivisto, D.; Lou, Y.; Zeng, Z.; Yu, T.; Wang, J.; Meng, X.; Nowikow, C.; Wilson, R.; Kumbhare, D.; Pu, J.
Show abstract
Deep learning has transformed medical image and video analysis, but it usually requires large, well annotated datasets. In many clinical domains, especially when testing novel mechanistic hypotheses, such retrospective datasets are hard to obtain since acquiring adequate cohorts is time intensive, costly, and operationally difficult. This creates a critical translational gap: scientifically compelling early stage ideas may remain untested due to lack of sufficient sample size to support conventional deep learning pipelines. Developing data-efficient strategies for evaluating new hypotheses within small prospective cohorts is therefore essential to de-risk innovation before large-scale validation. Myofascial Pain Syndrome (MPS) exemplifies this challenge, as quantitative ultrasound imaging biomarkers for MPS remain underexplored. We investigated whether MPS in the upper trapezius can be detected from full B-mode ultrasound videos in a small prospective cohort (11 controls, 13 patients). Videos were automatically preprocessed and resampled using a sliding window strategy to expand training samples (404 clips). A self-supervised Video Diffusion Encoder (VDE) is developed to learn spatiotemporal representations without relying on extensive labeled data, and compared it with transfer-learning-based ResNet, VideoMAE, and SimCLR. Using subject-level stratified four-fold cross-validation, the VDE outperformed transfer learning baselines and achieved performance comparable to SimCLR, with subject-level AUC of 0.79 and accuracy of 0.86, and no significant differences between latent-only and combined trigger point analyses. These results demonstrate that self-supervised diffusion learning can support robust, data-efficient deep learning in small prospective studies, enabling early feasibility testing of innovative ultrasound biomarkers before large-scale clinical trials.
Shen, L.; Chao, L.; Liu, T.; Liu, Q.; Zhou, G.; Wang, H.; Dong, X.; Li, T.; Zhang, X.; Ni, J.
Show abstract
While protein language models typically rely on sequence-only pretraining objectives, this approach often fails to capture structural regularities and demands large computation. To address this, we introduce ProteinSage, a pretraining framework that learns protein representations under explicit structural constraints. ProteinSage incorporates structural signals via structure-guided masking and a causal objective designed to model long-range dependencies. This structure-constrained pretraining endows ProteinSage with highly transferable representations that achieve superior performance across diverse structure-aware and general protein modeling benchmarks, while requiring substantially less computation.To determine whether these gains stem from genuine structural generalization rather than task-specific fitting, we applied ProteinSage to a structure-driven protein discovery task, focusing on proteins with multi-pass trans-membrane helical architectures such as distantly related microbial rhodopsins. The model successfully identified six previously unannotated microbial rhodopsin homologs. Together, our work establishes structure-constrained pretraining as an effective pathway toward data-efficient and structurally faithful protein representation learning.
Peddi, N.; Bijjula, D. R.; Gogte, S.; Kondaparthi, V.
Show abstract
Major Histocompatibility Complex (MHC) molecules are essential to the immune system because they bind and present peptide antigens to T cells, enabling immune recognition and response. The specificity of MHC-peptide interactions is crucial for understanding immune-related diseases, developing personalized immunotherapies, and designing effective vaccines. Current computational methods, while powerful, often rely on a single type of molecular information, usually sequence, and implicitly model the interaction between the two molecules. To address these limitations, we introduce MHC-Bind, a novel deep learning framework that captures a more comprehensive and biologically relevant view of the binding event. MHCBinds architecture employs a dual-view feature extraction strategy for both the MHC and the peptide. A Graph Attention Network (GAT) learns topological features from predicted residue contact maps, while a parallel 1D Convolutional Neural Network (CNN) captures multi-scale patterns from sequence embeddings. These four distinct feature sets are then integrated in a cross-fusion module that uses an attention mechanism to model interactions between the two molecules. Finally, a multi-layer perceptron (MLP) regression head maps the fused interaction signature to a precise binding affinity score. In rigorous comparative benchmarks against recent variants, such as NetMHCpan, MHCFlurry, and MHCnuggets, MHCBind demonstrates superior performance, achieving a significantly lower average prediction error (RMSE: 0.1485) and a higher correlation (PCC: 0.7231) in allele-specific contexts. For pan-allele tasks, it excels at correctly ranking peptides with a superior Spearmans Correlation (SCC: 0.7102), a crucial advantage for practical applications. The frameworks design is inherently flexible, excelling in both allele-specific and pan-allele prediction tasks.
Colangelo, G.; Marti, M.
Show abstract
The space of possible phenotype profiles over the Human Phenotype Ontology (HPO) is combinatorially vast, whereas the space of candidate disease genes is far smaller. Phenotype-driven diagnosis is therefore highly non-bijective: many distinct symptom profiles can correspond to the same gene, but only a small fraction of the theoretical phenotype space is biologically and clinically plausible. When a structured ontology exists, this constraint can be exploited to generate realistic synthetic cases. We introduce GraPhens, a simulation framework that uses gene-local HPO structure together with two empirically motivated soft priors, over the number of observed phenotypes per case and phenotype specificity, to generate synthetic phenotype-gene pairs that are novel yet clinically plausible. We use these synthetic cases to train GenPhenia, a graph neural network that reasons over patient-specific phenotype subgraphs rather than flat phenotype sets. Despite being trained entirely on synthetic data, GenPhenia generalizes to real, previously unseen clinical cases and outperforms existing phenotype-driven gene-prioritization methods on two real-world datasets. These results show that when patient-level data are scarce but a structured ontology is available, principled simulation can provide effective training data for end-to-end neural diagnosis models.
Cao, K.; Li, R.; Strazar, M.; Brown, E. M.; Nguyen, P. N. U.; Pust, M.-M.; Park, J.; Graham, D. B.; Ashenberg, O.; Uhler, C.; Xavier, R.
Show abstract
The interaction between T cell receptors (TCRs), peptides, and human leukocyte antigens (HLAs) underlies antigen-specific T cell immunity. Despite substantial advances in peptide- HLA presentation prediction, accurate modeling of coupled TCR-peptide-HLA recognition remains underdeveloped, limiting applications such as TCR and neoepitope prioritization in cancer and antigen identification in autoimmunity. Here we present StriMap, a unified framework for predicting TCR-peptide-HLA interactions by integrating physicochemical, se uence-context, and structural features at recognition interfaces. StriMap achieves state-of-the-art performance with improved generalizability and enables applications in both cancer and autoimmunity. As a case study in ankylosing spondylitis (AS), we screened 13 million peptides derived from 43,241 bacterial proteins and identified candidate molecular mimics that were experimentally validated to activate T cells expressing an AS-associated TCR. Notably, a top validated peptide was enriched in patients with inflammatory bowel disease (IBD), suggesting potential shared microbial triggers between AS and IBD. Overall, StriMap provides a generalizable framework for rational immunotherapy design and for dissecting antigenic drivers of autoimmunity.
Bian, B.; Zhang, Y.; Zhang, J.; Asai, K.; Saito, Y.
Show abstract
mRNA coding sequence design is a critical component in the development of mRNA vaccines, nucleic acid therapeutics, and heterologous gene expression systems. While large language models have recently been successfully applied to protein design and RNA modeling, designing optimal mRNA coding sequences for a given protein, particularly in a species-specific manner, remains a major challenge. Here, we present Pro2RNA, a multimodal reverse-translation language model that generates mRNA coding sequences from their corresponding protein sequences while explicitly conditioning on host organism taxonomy information. Pro2RNA integrates multiple pretrained language models across different modalities, including ESM2 for protein representation, SciBERT for taxonomy understanding, and a generative RNA language model for mRNA codon-level sequence generation. By training on mRNA-protein pairs from eukaryote and bacteria datasets respectively, Pro2RNA learns species-dependent genetic codes and codon usage patterns, enabling the generation of host-adapted and natural-like mRNA coding sequences. Across multiple benchmark evaluations, Pro2RNA matches or surpasses existing optimization methods, demonstrating its potential as a powerful and flexible framework for species-aware mRNA coding sequence design.
Znabu, B. F.; Atif, Z.
Show abstract
Hepatocellular carcinoma (HCC) is a leading cause of cancer mortality worldwide, yet existing prognostic models incompletely capture its molecular heterogeneity. We developed an interpretable, attention-based multi-branch deep learning framework for multi-omics survival prediction in HCC. Using 358 TCGA LIHC patients with matched mRNA expression, miRNA expression, and DNA methylation data, we first reproduced the Chaudhary et al. autoencoder-based survival model as a baseline (C-index = 0.561, log-rank p = 3.10 x 10-2). We then designed a multi-branch architecture with omics-specific encoders, multi-head attention fusion, and Cox partial likelihood training, optimized via Bayesian hyperparameter search (100 Optuna trials). In 5-fold stratified cross-validation with nested feature selection (no data leakage), our attention model achieved a mean C-index of 0.683 {+/-} 0.039, outperforming the autoencoder baseline (0.561) and clinical-only model (0.637), and performing similarly to an AUTOSurv-like benchmark (0.697). Branch dropout enabled single-omics inference; external validation on the real GSE14520 cohort (n=221, mRNA) achieved a C-index of 0.637 (p = 0.004), comparable to Chaudhary et al.s reported 0.67 on the same data. Integrated gradients and attention weights highlighted features with prior links to HCC biology, including cell cycle genes (CCNA2, PLK1) and a Wnt pathway component (FZD7), along with candidate biomarkers stable across all cross-validation folds (PZP, SGCB, CD300LG, ZNF831 for mRNA; 12 miRNAs; 6 CpG sites). Differential expression analysis between model-defined risk groups identified 381 significant genes (Bonferroni p < 0.05), though this analysis is partly circular. Multivariable Cox regression indicated that the model-derived risk score adds prognostic value beyond clinical variables, with consistent performance across clinical subgroups, though clinical integration metrics were evaluated on training data. This framework provides a transparent, biologically grounded approach to multi-omics prognostication in HCC.
Lteif, D.; Jia, S.; Bit, S.; Kaliaev, A.; Mian, A. Z.; Small, J. E.; Mangaleswaran, B.; Plummer, B. A.; Bargal, S. A.; Au, R.; Kolachalama, V. B.
Show abstract
Structural magnetic resonance imaging (MRI) is a cornerstone for diagnosing neurological disorders, yet automated interpretation of multi-sequence brain MRI remains limited by challenges in cross sequence reasoning and protocol variability. Here we present ReMIND, a vision-language modeling framework tailored for comprehensive multi-sequence and multi volumetric brain MRI analysis. Trained on over 73,000 deidentified patient visits encompassing more than 850,000 MRI sequences paired with radiology reports from diverse clinical and research cohorts, ReMIND combined large scale instruction tuning on more than one million clinically grounded question answer (QA) pairs with targeted supervised fine-tuning for radiology report generation. At inference, ReMIND employed modality aware reranking and correction, a report level decoding strategy that suppressed unsupported modality claims while preserving linguistic fluency and clinical coherence. Cross-cohort generalization was maintained on independent external datasets from different institutions. These findings represent an advance toward consistent and equitable brain MRI interpretation, meriting prospective evaluation to support diagnosis and management of neurological conditions.
Hochner-Vilk, T.; Stein, D.; Schueler-Furman, O.; Raveh, B.; Chook, Y. M.; Schneidman-Duhovny, D.
Show abstract
Domain-peptide interactions mediate a significant fraction of cellular protein networks, yet accurately predicting their specificity remains challenging. Peptide motifs typically have short, fuzzy sequence profiles, and their interactions are often weak and transient, limiting the size, coverage, and quality of experimentally validated domain-peptide datasets. Since true non-binders are rarely known, constructing negative examples often introduces bias. While structure-based prediction methods can achieve high accuracy, they are computationally demanding and difficult to scale to the proteome level. We introduce CLIPepPI, a dual-encoder model that leverages contrastive learning to embed domains and peptides into a shared space directly from sequence. Both encoders are initialized from a protein language model (ESM-C) and fine-tuned using lightweight LoRA adapters, enabling parameter-efficient training on positive pairs alone. To overcome data scarcity, we augment [~]3K protein-peptide complexes from PPI3D with [~]150K domain-peptide pairs derived from protein-protein interfaces. CLIPepPI further injects structural information by marking interface residues in the domain sequence, thus guiding the encoders toward binding regions and linking sequence-level learning with structural context. Competitive performance is achieved across three independent benchmarks: domain-peptide complexes from PPI3D, large-scale phage-library data from ProP-PD, and a curated dataset of nuclear export signal (NES) sequences. We demonstrate scalability and generalization through two applications: (i) proteome-wide NES scanning, and (ii) variant-effect prediction, where score changes in domain-peptide interactions between wild-type and mutant sequences discriminate pathogenic from benign variants. Together, CLIPepPI offers a scalable, structure-informed model for predicting domain-peptide specificity and generating meaningful embeddings suited for large-scale proteomic analyses. CLIPepPI is available at: https://bio3d.cs.huji.ac.il/webserver/clipeppi/.
Pilz, M.; Scheid, J.; Bauer, A.; Lemke, S.; Sachsenberg, T.; Bauer, J.; Nelde, A.; Stadelmaier, J.; Walter, A.; Rammensee, H.-G.; Nahnsen, S.; Kohlbacher, O.; Walz, J. S.
Show abstract
The immune system eliminates malignant and infected cells through T-cell-mediated recognition of peptides presented by human leukocyte antigen molecules. Mass spectrometry-based immunopeptidomics enables unbiased identification of naturally presented HLA-restricted peptides and has become central to the development of T-cell-based immunotherapies. However, immunopeptidomics data reflects the combined peptide presentation of multiple HLA alleles, and determining which allotypes are represented in this multi-allelic complexity remains an unmet computational challenge. Here, we introduce immunotype, a deep learning-based ensemble predictor for HLA class I allotyping directly from immunopeptidomics data. Immunotype integrates peptide and HLA protein sequence information through transformer encoders and a graph neural network, complemented by a curated mono-allelic reference of known peptide-HLA binding preferences. Immunotype achieves an overall accuracy of 87.2% at protein-level resolution across diverse tissues and thereby enables rapid, cost-effective HLA typing of large-scale immunopeptidomics datasets.
Hornak, G.; Heinolainen, A.; Solyomvari, K.; Silen, S.; Renkonen, R.; Koskinen, M.
Show abstract
Selecting an effective treatment relies on accurately anticipating patient's response to alternative interventions. However, forecasting longitudinal clinical trajectories remains difficult because electronic health records contain heterogeneous, irregularly sampled data over extended time periods. These issues are especially relevant for laboratory measurements, which are central for diagnostics, assessment of therapeutic responses, and tracking disease progression in routine clinical practice. However, existing deep learning methods for counterfactual prediction usually assume regularly sampled data, an assumption incompatible with the irregular, heterogeneous data-generation processes of real-world clinical practice. Here we present the Time-Aware G-Transformer, which integrates causal G-computation with time-aware attention to predict counterfactual outcomes on irregular data. By explicitly conditioning on the timing of future observations and encoding measurement patterns, the model captures temporal dynamics that previous methods overlook. Evaluated on synthetic tumor growth data and on 90,753 cancer patient trajectories from an academic medical center, our approach demonstrates superior long-horizon (> 1 day) prediction accuracy and uncertainty calibration compared to state-of-the-art baselines. These results demonstrate that embedding temporal relations directly into the attention mechanism enables robust integration of patient history data for evaluating potential treatment strategies in personalized medicine.
Hesse, J.; Schum, D.; Leidel, L.; Gareis, L. R.; Herrmann, J.; Müller, R.; Sieber, S. A.
Show abstract
Antibiotic resistance continues to rise, yet most new drug candidates act through long-established targets. Faster mode of action (MoA) assessment would enable more effective prioritization of screening hits and help identify compounds with novel mechanisms. In this study, we aimed to develop a scalable framework for MoA inference from antibiotic-induced cellular response profiles in Escherichia coli. We generated a multimodal dataset spanning more than 50 antibiotics, including proteome profiles, chemical structure descriptors, inhibitory concentrations and growth dynamics, and used it to build MAPPER (Mode of Action Prediction via Proteomics-Enhanced Representation), a framework comprising a fixed multimodal predictor and an uncertainty module. MAPPER accurately classified antibiotics across nine mechanistic classes, flagged compounds with likely novel mechanisms and retained predictive power in proteomics-only transfer experiments across mass spectrometry platforms and external data. Together, these results establish MAPPER as an innovative tool for MoA prediction and novelty detection, enabling prioritization of antibacterial candidates with distinct mechanisms.
Xiao, M.; He, Y.; Hu, J.; Zou, F.; Zou, B.
Show abstract
Perturbational transcriptomics links therapeutic compounds to cellular mechanisms and provides a powerful framework for drug discovery, but experimentally profiling transcriptional responses across diverse cell states, doses and durations is costly and often infeasible. Here we present DEPICT (Drug rEsponse Prediction in transCriptomics with Transformers), a deep learning framework that predicts condition-matched drug-induced transcriptional responses from baseline gene expression, perturbation settings and complementary drug representations. Using the LINCS L1000 dataset, DEPICT generalized to unseen drugs and cell types and outperformed five baseline strategies and two recent deep learning models. In the most challenging unseen-cell evaluation, DEPICT was the only model to surpass all baselines, improving differential-expression prediction accuracy and reducing perturbed-expression prediction error by 30.3% and 36.8%, respectively, relative to the next-best deep model. In a non-small cell lung cancer (NSCLC) case study, DEPICT-enabled virtual screening prioritized compounds predicted to reverse disease-associated transcriptional signatures. Notably, 13 of the top 20 prioritized compounds had either previously entered NSCLC-related clinical trials or been validated in NSCLC studies, supporting the translational relevance of the predicted perturbational profiles. DEPICT further enabled condition-matched drug synergy prediction and mechanistic exploration when experimentally matched profiles were unavailable. Together, these results show that accurate, condition-matched in silico perturbation profiling can scale transcriptomics-driven hypothesis generation for drug repurposing and combination discovery.
Wang, R.; Jin, K.; Pan, L.
Show abstract
Protein language models (PLMs) are increasingly central to protein engineering and drug discovery. Many high-performing systems, however, rely on large parameter counts, multiple sequence alignments (MSAs), explicit structural inputs, or computationally intensive attention mechanisms, limiting their accessibility and throughput. Here we present AINN-P1, a 167M-parameter protein language model trained exclusively on raw UniRef amino-acid sequences using an autoregressive next-token prediction objective. AINN-P1 employs a multiplicative LSTM (mLSTM) architecture--an attention-free, recurrent design that scales linearly with sequence length and avoids growing key-value caches during inference. We evaluate AINN-P1 on ProteinGym fitness prediction tasks spanning activity, binding, expression, and stability using a frozen-encoder protocol with lightweight few-shot regression heads. Under this protocol, AINN-P1 achieves an average Spearman{rho} of 0.441 across four task categories and a Spearman{rho} of 0.625 on stability--the highest among sequence-only models in our comparison set. Because our evaluation uses few-shot supervised regression rather than the zero-shot scoring employed by most ProteinGym leaderboard baselines, direct numerical comparison requires caution; we discuss this methodological distinction throughout. Beyond benchmark performance, AINN-P1 emphasizes practical deployability: its recurrent architecture avoids quadratic memory scaling, supports fixed-state inference on long sequences, and enables rapid adaptation through frozen embeddings rather than costly end-to-end fine-tuning. We discuss when sequence-only models are sufficient when structural information remains beneficial and how compact foundation models can serve as efficient front-end filters in drug discovery workflows.
Su, Z.; Wu, Y.
Show abstract
Controlling complex biological systems across multiple scales remains a major challenge in computational medicine, because whole-body disease behavior is closely shaped by noisy cellular events at much smaller scales. Standard deterministic models often miss this molecular variability, while fully stochastic simulations are too slow for the repeated, high-throughput interactions needed to train artificial intelligence. To address this problem, we developed a new AI-based framework that combines a discrete stochastic Gillespie algorithm for microscale receptor dynamics with continuous, nonlinear ordinary differential equations for systemic macroscale behavior. To reach the speed needed for deep reinforcement learning (RL), we compress this hybrid system into a differentiable Neural ODE surrogate that acts as a fast digital twin. As a proof of concept, we applied this framework to engineered cellular therapy and used RL agents to learn dynamic, closed-loop treatment policies inside the surrogate environment. By tracking microscopic, unpredictable cellular activity as an early-warning signal, the AI learned to continuously adjust the drug dose--anticipating and stopping dangerous immune reactions before they could spiral out of control. This computational advance improved successful control rates to more than 70% in highly unstable simulated phenotypes and provides a practical, general framework for adaptive intervention in multiscale biological systems.
Cai, J.; Gatz, A. E.; Li, J.; Pal, D.; Tang, H.; Eadon, M. T.; Yang, B.; Meng, L.; Su, J.
Show abstract
Acute kidney injury in sepsis evolves over hours to days, yet most ICU models emphasize onset and provide limited insight into cardio-renal interactions. We developed AKI-twinX, an organ-structured, explainable digital twin that jointly forecasts acute kidney injury onset, acute kidney injury trajectory, and near-term mortality risk. The model learns renal and cardiovascular latent states with sparse feature gating and captures cross-organ coupling with attention. We trained AKI-twinX on MIMIC-IV sepsis using 5-fold cross-validation and evaluated it on an Indiana University Health cohort. Discrimination was consistent across systems (AUC: mortality 0.86-0.88, acute kidney injury onset 0.78-0.82, acute kidney injury trajectory 0.73-0.78). In vasopressor-treated windows, 12-hour systolic blood pressure forecasts tracked observed values (mean absolute error 8.5 mmHg). Counterfactual vasopressor withdrawal shifted predicted blood pressure downward and increased predicted risk, supporting sensitivity to clinically meaningful interventions. AKI-twinX enables trajectory-aware forecasting with bedside auditability in sepsis.
Liu, Y.; Zhang, Z.
Show abstract
Deep learning models utilizing longitudinal healthcare data have significantly advanced epidemiological research. However, contemporary transformer-based models increasingly rely on computationally intensive pre-training steps that entail processing massive real-world datasets with cost-prohibitive hardware. We introduce the Temporal Encoder with Late Fusion (TELF), a lightweight end-to-end predictive model featuring an encoder-only architecture for processing medical codes, followed by post-encoder concatenation with demographic variables. TELF learns code embeddings on-the-fly, thereby bypassing the resource-intensive pre-training bottleneck. Furthermore, its late-fusion design preserves the integrity of the temporal attention mechanism before integrating static demographic predictors. We evaluated TELF using an administrative claims database across three distinct cohorts: pancreatic cancer (n=53,661), type 2 diabetes (n=78,756), and heart failure (n=72,540). TELF consistently outperformed traditional machine learning baselines, including XGBoost, LightGBM, and logistic regression. Specifically, TELF achieved AUCs of 0.9150, 0.8199, and 0.8721 for pancreatic cancer, type 2 diabetes, and heart failure, respectively, compared with 0.9044, 0.7908, and 0.8535 for XGBoost and 0.9014, 0.7800, and 0.8466 for logistic regression. Beyond predictive superiority, TELF's isolated temporal attention mechanism enables population-level motif mining. By extracting high-attention temporal sequences, we mapped aggregated patient journey pathways, revealing interpretable clinical trajectories preceding disease onset. Collectively, these results demonstrate that TELF provides a resource-efficient and accessible framework for advanced temporal modeling in clinical and epidemiological research.
Carrillo Barrera, P.; Babey, A.; Pena, C. A.
Show abstract
The scalability of phage therapy as a viable alternative or complement to antibiotics is limited by the labor-intensive experimental screening required to identify compatible phage-bacterium pairs. To accelerate this discovery process, we propose FoundedPBI, an ensemble deep learning approach that leverages the emergent capabilities of genomic foundation models, large language models pre-trained on vast DNA corpuses to predict phage-bacterium interactions from DNA sequences alone. We employ an ensemble strategy that aggregates outputs from three state-of-the-art DNA language models into a unified meta-embedding, which is then processed by a neural classifier. Our approach makes two key contributions: (1) We demonstrate that performing ensemble learning across models trained on different genomic data--i.e., prokaryotic (Nucleotide Transformer v2, DNABERT-2) and bacteriophage (MegaDNA) genomes--captures partially-orthogonal biological signals, yielding 6% F1-score improvement over the best individual model. (2) We adapt long-context NLP aggregation strategies to handle whole bacterial and phage genomes (up to 5M base pairs) that exceed the foundation models context windows (12-96K bp) by a factor of 50-100, a critical challenge largely unaddressed in prior genomic deep learning work. On the PredPHI benchmark, FoundedPBI achieves a 76% F1-score outperforming the current state-of-the-art (PBIP) by 7%. On our internal dataset (CI4CB), we achieve 93% F1-score, improving our previous best methods by 4%. These results demonstrate that ensemble learning with proper long-context handling enables effective knowledge transfer of genomic foundation models to specialized prediction tasks.